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LEC-12910 


A STATISTICAL TEST PROCEDURE FOR DETECTING 
MULTIPLE OUTLIERS IN A DATA SET 

1. INTRODUCTION 

Data contamination is a fairly common problem. When analyzing data, it is 
sometimes desirable to examine whether or not observations have come from 
the same distribution and <:o detect the potential outliers in the data. So 
far, any solution to this problem has been limited to the detection of a 
specified number of outliers. The number of outliers is generally unknown 
and cannot be specified in advance. The major drawbacks of testing for a 
fixed number of outliers are discussed in reference 1. 

The sample size is an important influence on the number of observations 
likely to be outliers. It is reasonable to think of outliers as a minority, 
hence not more than 50 percent of a set of observations can be outliers. 

It has been suggested (ref. 1) that a certain percentage of the data 
should be considered for potential outliers; however, the percentage should 
be variable rather than fixed. For example, it is reasonable to consider 
3 potential outliers in a data set of 10 observations, but it is unrealistic 
to expect 30 outliers out of a data set of 100 observations. In the latter 
case, the outlier detection problem becomes one of discrimination between 
two or more classes of data. 

Several test statistics have been suggested for the significance test for 
detecting outliers (refs. 1 through 4). Among all the suggested tests, the 
extreme studentized deviate (ESD) test procedure is most favored, as it is 
shown to have more power against certain alternatives for the number of 
outliers and their distributions, and this procedure is computationally 
simple. The problem of obtaining the joint distribution of the ESD test 
statistics for the multiple outliers detection is still intractable; however, 
certain percentage points in the case of testing for the existence of 
specifically one or two outliers have been obtained for the test statistics 
using the Monte Carlo technique (ref. 1). 
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In the present Nonte Carlo analysis, Rosner's table (ref. 1) is extended to 
include percentage points for the 5-percent significance level for testing 
as many as 19 outliers in a data set for which the primary distribution is 
normal. Also given are certain empirically developed relationships jwhich 
can be used to easily obtain the percentage points of the ESD test statistics 
for various combinations of sample size and the hypothetical number *of 
outliers. 
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2. THE TEST PROCEDURE 


The test procedure based on ESD statistics Is described in this section. 
Following Rosner (ref. 1), who has given a general formulation of many 
outlier test procedures, let % 2 * ***• ^ ^ possible 

outliers. Consider the sequence of subsets Aq, A|^, where 

Aq = {xp Xg. X^} and A.^.^ = A^ - 1 = 1,2, •••, k-1 where X^’^ 

is defined by 


x"> - 7(A,), 


MAX 

XjcA^ 


Xj - X(A^) 


X(A 


) - 1 V X 


X.eA. ^ 

J * 


Thus, ••• ca^«^Aq and A^. is obtained by deleting froi.i A^_^ the data 

point farthest away from the mean of A^_^. The test statistic t(A^^^) 
applied to assess the significance of the most outlying observation in A^ 
is defined by 


t(Ai^.i) 


MAX 

X.eA. 

S J ' 


Xj - X(A,) 


(1) 


where 


° n - i - 1 ^ [*J ■ *(*i*]^ • i = 0. 2, •••, k - 1 




Considering Hq, the hypothesis of no outlier present in the data, the 
significance test procedures is to reject the hypothesis and declare that 
the data contains some outliers if 

t(A^) > for some i ejl, 2, ••*, k} (2) 
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where Agt •••! are determined by 

Prob| u jt(A^) > * « (3) 

with a as the desired significance level. Clearly, the determination of 
requires knowing the joint distribution of t(A^), t(A2)» *•*. t(A|^) under Hq. 
Moreover, the solution of equation (3) forms a hyperplane and thus there is 
no unique set of X^ satisfying equation (3). To achieve a unique solution, 
another requirement besides equation (3) is introduced by considering critical 
regions to have a fixed significance level 3 in each dimension. As such, 
the procedure is to find a and A^|^(a), i = 1,2, k so that, given Hq, 

Prob{t(A^) > X^,^(a)} » a (4) 

and 

Probj^U [t(A^.) > X^k(3)]| = a (5) 


If t(A^) < Aj|((3) simultaneously for all i, no outliers in the data are 
declared. Otherwise, if [t(A^-) > ^j|^(3)] holds and 


m= MAX |i: t(A,.) 
i=l,2,---,k ' ’ 


> X 


ik 


(e)} 


then •••» are declared as outliers, I.e., the data points 

excluded to form A„ are to be declared as outliers. 


It may be pointed out that though a in equation (4) is chosen independent of 
i, it can be considered different for different outliers. Unless there is a 
specific reason, it is appropriate to attach equal significance to different 
outliers and to have the same a for all i in equation (4). 


3. DETERMINATION OF CRITICAL VALUES FOR t(A.) 

The joint distribution of t(A^), i = 1, 2, •••, k, is needed to obtain the 
critical values of for equation (5). No exact derivation of the 

distribution is possible; instead, the critical values for a 5-percent 
significance level are evaluated by the Monte Carlo procedure. For each 
sample size n = 3(1)7, 10(5) 3u(lU) 100, 1000 samples of ordered normally 
distributed observations were generated. Considering detection of k outliers, 
where k - 1, 2, •••, < 19, the en^irical distribution of t(A^) was 

obtained with intervals or probability of size 0.001, enabling us to find g 
and ^^|((3) such that 

Prob[t(A^) > X^,^(3)] =3 ; for i = 1 , 2, •••, k 
and 

Probju |^t(A^. ) > ,«]| . 0.05 

The critical values of X^j^(g) are presented in table I. For k * 2, these 
values are also given in reference 1. A comparison between the values 
given here and corresponding values given in table 7, reference 1, indicates 
a difference at the second decimal place. Since a larger number of samples 
are used in computations by Rosner (ref. 1), the critical values given in 
his paper for the case of k = 2 should be regarded as being more accurate. 

In order to obtain the critical values for the 5-percent significance 
level for any combination (n, k), where n < 100 and k < 19, a model was 
empirically developed for Xj|^(3) as a function of n and k. A least-square 
fit led to the following equation for approximating the critical values. 

\kn ' Nk* ®ik 0 i i = 2. — . I' (6) 

where values for A^.j^ and are given in tables II and III, respectively. 
Although equation (6) can be used to obtain approximate critical values, the 
relative difference of an approximated value from the actual could be as 
high as 5 percent. When the outliers are outstanding, it is safe to use 
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equation (6) to approximate the critical values of the significance test. 

On the other hand, more accurate critical values may need to be computed 
and used to detect outliers that are not so outstanding. Considering that 
errors in the distributional assumption could invalidate the test even if its 
exact critical values are known, equation (6) can be regarded as a practical 
solution to the problem of determining the critical values of a significance 
test at the 5- percent level for detecting as many as 19 outliers in a data 
set. 
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TABLE I- CRITICAL VALUES AT 5-PERCENT SIGNIFICANCE LEVEL 
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